Prediction of cytokine signaling activity

In this tutorial, we utilize the GSE154109 dataset, which contains single-cell RNA sequencing (scRNA-seq) data from acute myeloid leukemia (AML) patients, to demonstrate how dimension reduction enhances cytokine activity inference. Cell type annotations were obtained from the TISCH2 database.

loading packages

import warnings
warnings.filterwarnings("ignore")

import scanpy as sc
import numpy as np
import pandas as pd
import anndata as ad
import scipy
from numpy import unique
import time
from FeaSc.util_mca import *
from FeaSc.pySaaSc import *
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, auc

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['figure.dpi'] = 100

loading data set

adata = sc.read_h5ad("data/AML_GSE154109/AML_GSE154109.h5ad")
adata.obs.index = adata.obs.index.str.replace('@', '_')
adata.layers['counts'] = adata.X.copy()
adata
AnnData object with n_obs × n_vars = 10799 × 17111
    obs: 'UMAP_1', 'UMAP_2', 'Cluster', 'Celltype (malignancy)', 'Celltype (major-lineage)', 'Celltype (minor-lineage)', 'Patient', 'Sample', 'Tissue'
    layers: 'counts'
obs = adata.obs
obs.index = obs.index.str.replace('@', '_')
adata.obs['celltype'] = adata.obs['Celltype (major-lineage)'].astype('category')
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)

loading gene sets

gs = read_gmt('data/CD8_Tcells_activation.gmt')
grate = get_gene_rate(background_geneset='data/c2.cp.v2025.1.Hs.symbols.gmt', signature_geneset=gs, mode="multiple")
grate
background signature
A1BG 0.001243 0.0
A1CF 0.000746 0.0
A2M 0.003231 0.0
A3GALT2 0.000746 0.0
A4GALT 0.001740 0.0
ZWILCH 0.003231 0.0
ZWINT 0.003231 0.0
ZYG11B 0.000497 0.0
ZYX 0.001989 0.0
ZZZ3 0.001243 0.0

13868 rows × 2 columns

Cytokine signaling activity

To infer Cytokine signaling activity, a model matrix file that quantifies the relative changes in gene expression is required. Ridge regression is then applied, using gene expression as the dependent variable (Y) and the model matrix as the independent variable (X). The activity of each cytokine is quantified as the ratio of the regression coefficient to its standard error. This methodological approach was first introduced in CytoSig. FeaSc is capable of utilizing the model matrix file from CytoSig as well as those from alternative sources.

We aimed to demonstrate that dimension reduction techniques could improve the inference of cytokine activity. In this study, we implemented three distinct dimension reduction methods: Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), and Multiple Correspondence Analysis (MCA).

Benchmarking the performance of Feasc for cytokine activity inference presents significant challenges. It is widely accepted in the scientific community that CD8 T cell activation negatively correlates with TGFβ1 cytokine activity. This relationship has been previously utilized in the Tres method to identify immune-response related genes in T cells. Therefore, a stronger negative correlation between inferred TGFβ1 activity and CD8 T cell activation scores serves as an indicator of improved performance.

No dimension reduction

sc.pp.normalize_total(adata, target_sum=1e4, inplace=True)
sc.pp.log1p(adata)
response_data = compute_response(adata, gene_rate=grate, celltype="CD8T", signature="Tcell_activation")
response_data
Tcell_activation
Cell
P1_AACCGCGGTGATGTGG-1 -0.951382
P1_AACGTTGGTGACTACT-1 -2.726300
P1_AAGACCTGTCTTTCAT-1 -2.347866
P1_ACTGTCCCACACGCTG-1 -2.380796
P1_AGAGTGGCAAGCGTAG-1 -2.167345
P8_TTTATGCTCGCCAAAT-1 -1.363836
P8_TTTATGCTCTCAACTT-1 -2.808227
P8_TTTGGTTCAATCGGTT-1 -2.056600
P8_TTTGGTTCACGGTAAG-1 -2.253479
P8_TTTGGTTTCATAGCAC-1 -3.258121

715 rows × 1 columns

signaling_data = compute_signaling(
            adata,
            model_file="data/signature.centroid.expand",
            celltype='CD8T',
            cytokine=None,
            lambd=10000,
            num_permutations=0,
            test_method="two-sided")
signaling_data
Activin A BDNF BMP2 BMP4 BMP6 CD40L CXCL12 EGF FGF2 GCSF PGE2 TGFA TGFB1 TGFB2 TGFB3 TNFA TRAIL TWEAK VEGFA WNT3A
barcode
P1_AACCGCGGTGATGTGG-1 -0.094772 -0.293682 -0.286625 0.077783 0.321539 0.116699 -0.144668 -0.310968 1.021896 -1.132794 -0.924823 0.452272 -0.386532 0.331252 -0.301805 0.189115 1.498093 0.408660 0.558532 -0.821857
P1_AACGTTGGTGACTACT-1 0.126739 0.315829 -0.230301 -0.379226 0.598421 0.169795 0.568125 -0.144612 1.183636 -1.289984 -1.073571 0.428590 -0.362775 -0.124503 0.291687 0.079039 1.463567 0.411479 0.365396 -0.689990
P1_AAGACCTGTCTTTCAT-1 -0.089947 -0.193842 -0.100216 -0.095971 0.927968 -0.287058 0.299551 -0.349758 1.653981 -1.660125 -1.409793 0.317874 -0.481422 -0.086316 -0.104492 0.197171 1.128967 0.545923 0.446443 -1.467653
P1_ACTGTCCCACACGCTG-1 0.295517 -0.241921 -0.245211 -0.163259 0.422830 0.007753 0.506617 -0.212739 1.020806 -0.995910 -1.566678 0.093569 -0.455467 -0.024590 -0.294104 -0.009445 1.357538 0.279438 0.786081 -0.863656
P1_AGAGTGGCAAGCGTAG-1 -0.011922 -0.126875 -0.053377 -0.133999 0.066901 0.175580 0.506431 -0.064762 0.922227 -1.010561 -1.298723 0.618760 -0.378771 0.086789 -0.083260 -0.019148 0.693604 0.091681 0.427866 -0.364654
P8_TTTATGCTCGCCAAAT-1 -0.065202 -0.228488 -0.376433 -0.057831 0.423303 -0.128511 0.495901 0.019771 1.505175 -1.124849 -1.657007 0.215126 -0.505389 -0.146098 -0.303647 0.053365 1.203349 0.107695 0.381438 -0.620364
P8_TTTATGCTCTCAACTT-1 0.129540 0.195420 -0.336719 -0.331438 0.673912 -0.260759 0.546337 0.081230 1.139160 -1.068861 -1.367952 0.478208 -0.515014 -0.057076 -0.206613 0.102047 1.359984 0.416357 0.448254 -1.036446
P8_TTTGGTTCAATCGGTT-1 0.332880 -0.126464 -0.275887 -0.225908 0.591053 -0.142498 0.963675 -0.046590 1.022271 -0.726532 -1.043937 0.845506 -0.448924 -0.372296 0.035262 0.138179 1.515752 0.281205 1.034865 -0.951830
P8_TTTGGTTCACGGTAAG-1 0.219233 -0.393705 -0.091263 -0.196145 0.891786 0.132119 0.818010 0.180460 1.217418 -1.298357 -0.430545 0.652221 -0.479393 -0.109512 -0.081216 0.304150 1.533638 0.465998 1.190769 -1.575603
P8_TTTGGTTTCATAGCAC-1 0.625703 0.066704 0.232557 0.221179 0.287647 0.047044 1.100452 0.040469 1.422489 -0.853484 -1.356366 0.483683 -0.276560 -0.192638 0.019490 0.134252 1.408722 0.471552 0.564261 -0.829856

715 rows × 51 columns

res = pd.concat([response_data, signaling_data[["TGFB1"]]], axis=1)
res
Tcell_activation TGFB1
P1_AACCGCGGTGATGTGG-1 -0.951382 -0.386532
P1_AACGTTGGTGACTACT-1 -2.726300 -0.362775
P1_AAGACCTGTCTTTCAT-1 -2.347866 -0.481422
P1_ACTGTCCCACACGCTG-1 -2.380796 -0.455467
P1_AGAGTGGCAAGCGTAG-1 -2.167345 -0.378771
P8_TTTATGCTCGCCAAAT-1 -1.363836 -0.505389
P8_TTTATGCTCTCAACTT-1 -2.808227 -0.515014
P8_TTTGGTTCAATCGGTT-1 -2.056600 -0.448924
P8_TTTGGTTCACGGTAAG-1 -2.253479 -0.479393
P8_TTTGGTTTCATAGCAC-1 -3.258121 -0.276560

715 rows × 2 columns

plot_response_signaling({'No dimension reduction': res})
png

PCA-based method

adata = rebuild_matrix(adata, method=['pca'], n_dim=15)
adata
WARNING: adata.X seems to be already log-transformed.





AnnData object with n_obs × n_vars = 10799 × 17111
    obs: 'UMAP_1', 'UMAP_2', 'Cluster', 'Celltype (malignancy)', 'Celltype (major-lineage)', 'Celltype (minor-lineage)', 'Patient', 'Sample', 'Tissue', 'celltype', 'n_genes'
    var: 'n_cells'
    uns: 'log1p', 'active_method', 'active_dim', 'pca'
    obsm: 'pca'
    varm: 'pca'
    layers: 'counts'
response_data = compute_response(adata, gene_rate=grate, celltype="CD8T", signature="Tcell_activation")
signaling_data = compute_signaling(
            adata,
            model_file="data//signature.centroid.expand",
            celltype='CD8T',
            cytokine=None,
            lambd=10000,
            num_permutations=0,
            test_method="two-sided"
        )
res = pd.concat([response_data, signaling_data[["TGFB1"]]], axis=1)
res
Tcell_activation TGFB1
P1_AACCGCGGTGATGTGG-1 -0.656675 0.529193
P1_AACGTTGGTGACTACT-1 -2.900612 0.939177
P1_AAGACCTGTCTTTCAT-1 -0.554478 -0.271388
P1_ACTGTCCCACACGCTG-1 -0.863269 0.583958
P1_AGAGTGGCAAGCGTAG-1 -2.330548 1.219573
P8_TTTATGCTCGCCAAAT-1 -3.068340 0.753479
P8_TTTATGCTCTCAACTT-1 -0.941963 0.271050
P8_TTTGGTTCAATCGGTT-1 -2.345652 0.996785
P8_TTTGGTTCACGGTAAG-1 -1.669373 0.566825
P8_TTTGGTTTCATAGCAC-1 -3.064377 1.090217

715 rows × 2 columns

plot_response_signaling({'PCA-based method': res})
png

NMF-based method

adata = rebuild_matrix(adata, method=['nmf'], n_dim=15)
adata
WARNING: adata.X seems to be already log-transformed.





AnnData object with n_obs × n_vars = 10799 × 17111
    obs: 'UMAP_1', 'UMAP_2', 'Cluster', 'Celltype (malignancy)', 'Celltype (major-lineage)', 'Celltype (minor-lineage)', 'Patient', 'Sample', 'Tissue', 'celltype', 'n_genes'
    var: 'n_cells'
    uns: 'log1p', 'active_method', 'active_dim', 'pca', 'nmf'
    obsm: 'pca', 'nmf'
    varm: 'pca', 'nmf'
    layers: 'counts'
adata.uns.get('active_method', None)
['nmf']
response_data = compute_response(adata, gene_rate=grate, celltype="CD8T", signature="Tcell_activation")
signaling_data = compute_signaling(
            adata,
            model_file="data/signature.centroid.expand",
            celltype='CD8T',
            cytokine=None,
            lambd=10000,
            num_permutations=0,
            test_method="two-sided"
        )
res = pd.concat([response_data, signaling_data[["TGFB1"]]], axis=1)
plot_response_signaling({'PCA-based method': res})
png

MCA-based method

adata.X = adata.layers['counts']
adata = rebuild_matrix(adata, method=['mca'], n_dim=15)
adata
WARNING: adata.X seems to be already log-transformed.
run mca...
mca step 1: Construct the Fuzzy Matrix and Row Weights
mca step 1 completed: Fuzzy Matrix and Row Weights constructed
svd started
svd completed
mca step 2: Calculate Cell Coordinates and Feature Loadings
mca step 2 completed: Cell Coordinates and Feature Loadings calculated





AnnData object with n_obs × n_vars = 10799 × 17111
    obs: 'UMAP_1', 'UMAP_2', 'Cluster', 'Celltype (malignancy)', 'Celltype (major-lineage)', 'Celltype (minor-lineage)', 'Patient', 'Sample', 'Tissue', 'celltype', 'n_genes'
    var: 'n_cells'
    uns: 'log1p', 'active_method', 'active_dim', 'pca', 'nmf', 'mca'
    obsm: 'pca', 'nmf', 'mca'
    varm: 'pca', 'nmf', 'mca'
    layers: 'counts'
response_data = compute_response(adata, gene_rate=grate, celltype="CD8T", signature="Tcell_activation")
signaling_data = compute_signaling(
            adata,
            model_file="data/signature.centroid.expand",
            celltype='CD8T',
            cytokine=None,
            lambd=10000,
            num_permutations=0,
            test_method="two-sided"
        )
res = pd.concat([response_data, signaling_data[["TGFB1"]]], axis=1)
plot_response_signaling({'MCA-based method': res})
png

As demonstrated above, dimension reduction using MCA significantly enhanced the performance of cytokine activity inference, yielding a Pearson correlation coefficient (PCC) of -0.53, compared to 0.10 when no dimension reduction was applied.

Signalling pathway activities

A conventional approach for pathway activity inference relies on gene set scoring. However, this method has a significant limitation: most gene sets only include pathway member genes while excluding downstream target genes, making them inadequate for accurately assessing signaling pathway activity. Recognizing this gap, the SEED2 database has systematically compiled molecular signatures across 16 distinct signaling pathways. These signatures capture the specific gene expression alterations that occur when the pathway of interest is perturbed, thereby providing a more comprehensive representation of pathway dynamics. As a result, these curated signatures can be directly implemented in FeaSc to generate robust and reliable inferences of signaling pathway activities, offering researchers a more accurate tool for pathway functional assessment in single-cell.

df = pd.read_csv('data/SPEED2_signaling.tsv', sep='\t', nrows=0)
result = df.columns.tolist()
print(result)
['Estrogen', 'H2O2', 'Hippo', 'Hypoxia', 'IL-1', 'Insulin', 'JAK-STAT', 'MAPK', 'TLR', 'Notch', 'p53', 'PI3K', 'PPAR', 'RAR', 'TGFb', 'TNFa', 'Trail', 'VEGF', 'Wnt']
adata.uns.get('active_method', None)
['mca']
signaling_data = compute_signaling(
            adata,
            model_file="data/SPEED2_signaling.tsv",
            celltype=None,
            cytokine=None,
            lambd=10000,
            num_permutations=0,
            test_method="two-sided"
        )
res = pd.concat([signaling_data[["Wnt"]], adata.obs['celltype']], axis=1)
res
Wnt celltype
barcode
P1_AAATGCCAGACTAGAT-1 0.934152 B
P1_AAGGTTCTCAACGGGA-1 0.489379 B
P1_ACACCAAGTACCGGCT-1 0.615619 B
P1_ACATCAGGTTTAAGCC-1 0.873929 B
P1_ACTTGTTGTGGCGAAT-1 1.433766 B
P8_TTAGGCAAGTACGATA-1 0.498144 Mono/Macro
P8_TTCGGTCGTTCAGACT-1 -0.210558 Mono/Macro
P8_TTCTCAACAAGCGCTC-1 1.765080 Mono/Macro
P8_TTCTCAAGTGTGACCC-1 1.612464 Mono/Macro
P8_TTGACTTCATGGATGG-1 0.951047 Mono/Macro

10799 rows × 2 columns

median_order = res.groupby('celltype')['Wnt'].median().sort_values(ascending=False).index
colors = ['red'] * 3 + ['blue']*3
plt.figure(figsize=(5, 4))
sns.boxplot(data=res, x='celltype', y='Wnt', order=median_order, palette=colors)
plt.title('Wnt activity by Cell Type')
plt.xlabel('Cell Type')
plt.ylabel('Wnt activity')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
png